1 Introduction

The matter of interest for this assignment will be the impact that incomplete data (observed data) has on our inferences compared to the inferences we make with complete data (true data). To investigate the effect that missing values have on model inferences, we will build a random multiple regression model.

Firstly, we provide descriptive statistics and correlations. In table 2.1 we compare the head of the observed data and the true data. Additionally in 2.2 the means and variances are compared. With regard to correlations, we present two correlations matrices. One for the observed data 2.4 and the other for the true data 2.5.

Secondly we present our multiple regression model in table 3.1. Our model consists of the outcome variable: active heart rate and the predictors: age and smoke. We also included an interaction effect between bmi and sex. The first three columns reflect the observed data whereas the following columns reflect the true data.

The research question we tend to answer in accordance with our model is: What impact do missing values have on an “active heart rate” model inference?

Thirldy we start by inspecting the missing values. We try to find out where the missing values occur. In 4.1 we start by giving a global overview of the missings. Then in 4.2 we compare the distributions for the observed data and the missing values.

Lastly we perform t-tests on the variables that contain missing values to check what type of missingness we are dealing with, i.e. MNAR, MAR or MCAR. We also provide plots here to visualize where the missing values occur.



2 Observed vs True data

In this section we will compare the observed with the true data set.

Table 2.1: Observed Data
age smoke sex intensity active rest height weight bmi
42 no female high NA 75 NA NA 22.4
31 NA male low NA 62 NA NA 23.8
36 no male low 109 76 182 78.0 23.5
31 no female low 78 62 164 53.9 20.0
42 no male low NA 66 189 NA 23.4
Table 2.1: True Data
age smoke sex intensity active rest height weight bmi
42 no female high 94 75 161 58.1 22.4
31 no male low 86 62 184 80.6 23.8
36 no male low 109 76 182 78.0 23.5
31 no female low 78 62 164 53.9 20.0
42 no male low 103 66 189 83.6 23.4

2.1 Descriptives

Obviously, neither the mean nor the variance of the variables age and rest changed, since they have no missing values.

The mean of active is also almost entirely unaffected. The variance of active changed a bit in the observed data, but this difference is simply due to sampling variability (we have deleted about 40% of the observations). The missing values in active are MCAR, so we would not expect any substantial changes in the marginal distribution of active.

The mean of height is is also almost entirely unaffected. The variance of active changed a bit in the observed data, but this difference is simply due to sampling variability (we have deleted about 30% of the observations). The missing values in height are MCAR, so we would not expect any substantial changes in the marginal distribution of height.

The mean of weight is is also almost entirely unaffected. The variance of active changed a bit in the observed data, but this difference is simply due to sampling variability (we have deleted about 57% of the observations). The missing values in weight are MCAR, so we would not expect any substantial changes in the marginal distribution of weight.

The mean of bmi is is also almost entirely unaffected. The variance of active changed a bit in the observed data, but this difference is simply due to sampling variability (we have deleted about 30% of the observations). The missing values in bmi are MCAR, so we would not expect any substantial changes in the marginal distribution of bmi.

Furthermore, the variance of the variables age and rest are unaffected in the observed data set. The variance of the variable active in the observed data set is .01 lower than the true data set, and thus almost entirely unaffected. However the variables height, weight, and bmi have greater variance in the true data set than the observed data set. This implies that the missingness causes an underestimation of the variance.

Table 2.2: Means and variances in true and observed dataset
Variables \(M_{obs}\) \(M_{true}\) var obs var true \(N_{obs}\) \(N_{true}\)
Age 38.52 38.52 149.73 149.73 306 306
Active 92.58 93.13 383.05 383.04 183 306
Rest 69.83 69.83 120.78 120.78 306 306
Height 174.50 173.99 100.66 105.29 214 306
Weight 73.91 73.58 260.26 274.85 132 306
Bmi 24.11 24.06 12.91 13.38 213 306
Note.
obs = Observed Dataset, true = True Dataset

2.2 Categorical variables

The categorical variables in both data sets are smoke, sex and intensity. Smoke and sex both have two levels (“no” and “yes” for smoke and “male” and “female” for sex), while intensity has three levels (“low”, “moderate”, and “high”).

Despite there being differences in amount of observed values between the data sets, differences between groups remain unchanged. There are more males than females in both data sets, as well as more non-smokers than smokers. Also, in both data sets more males reported smoking than females. The most frequent reported workout intensity for both males and females in the two data sets is moderate, followed by low and high.

Table 2.3: proportion table of categorical in observed and true data
sex smoke intensity Freq
male no high 0.040
female no high 0.060
male yes high 0.065
female yes high 0.056
male no moderate 0.113
female no moderate 0.149
male yes moderate 0.085
female yes moderate 0.073
male no low 0.165
female no low 0.129
male yes low 0.048
female yes low 0.016
sex smoke intensity Freq
male no high 0.046
female no high 0.059
male yes high 0.075
female yes high 0.046
male no moderate 0.108
female no moderate 0.147
male yes moderate 0.085
female yes moderate 0.062
male no low 0.190
female no low 0.124
male yes low 0.046
female yes low 0.013
Note. On the left side of the table the proportions of the observed
data are shown, whereas the proportions of the true data are represented
on the right side of the table.

2.3 Correlations

As shown in Table 2.4 and Table 2.5, the correlations between the variables of the observed data set are slightly different than the correlations between variables of the true data. Although the majority of the correlations are almost identical, a few correlations are negative in the observed data, and positive in the true data. This effect also occurs vice versa. In example, the correlation between the variables smoke and age of the observed data set is positive (r = 0.01) albeit almost 0, while the correlation for these variables in the true data set is negative (r = -0.05).

However, the impact of missing data on the correlations appears to be small, as the difference in correlation coefficients between the two data sets are negligible. Although some correlations differ in valency between the data sets, the correlation coefficients remain close to 0, and thus, not distort inferences made with the observed data set.

Table 2.4: Correlations of observed data
age smoke sex intensity active rest height weight bmi
age 1.00 0.01 -0.17 0.21 -0.49 -0.39 0.19 0.25 0.18
smoke 0.01 1.00 -0.09 -0.29 0.15 0.23 0.18 0.18 0.18
sex -0.17 -0.09 1.00 -0.09 0.11 0.06 -0.73 -0.68 -0.42
intensity 0.21 -0.29 -0.09 1.00 -0.37 -0.55 0.13 0.12 0.02
active -0.49 0.15 0.11 -0.37 1.00 0.56 0.00 0.01 0.05
rest -0.39 0.23 0.06 -0.55 0.56 1.00 -0.20 -0.12 0.06
height 0.19 0.18 -0.73 0.13 0.00 -0.20 1.00 0.78 0.34
weight 0.25 0.18 -0.68 0.12 0.01 -0.12 0.78 1.00 0.88
bmi 0.18 0.18 -0.42 0.02 0.05 0.06 0.34 0.88 1.00
Table 2.5: Correlations of true data
age smoke sex intensity active rest height weight bmi
age 1.00 -0.05 -0.17 0.21 -0.54 -0.39 0.20 0.23 0.20
smoke -0.05 1.00 -0.11 -0.31 0.18 0.27 0.17 0.25 0.24
sex -0.17 -0.11 1.00 -0.09 0.09 0.06 -0.72 -0.69 -0.47
intensity 0.21 -0.31 -0.09 1.00 -0.37 -0.55 0.12 0.06 0.01
active -0.54 0.18 0.09 -0.37 1.00 0.61 -0.10 0.02 0.09
rest -0.39 0.27 0.06 -0.55 0.61 1.00 -0.15 -0.04 0.05
height 0.20 0.17 -0.72 0.12 -0.10 -0.15 1.00 0.77 0.36
weight 0.23 0.25 -0.69 0.06 0.02 -0.04 0.77 1.00 0.87
bmi 0.20 0.24 -0.47 0.01 0.09 0.05 0.36 0.87 1.00

3 Regression

Table 3.1: Regression analysis of True (N=306) and Observed Data (N=155)
\(\beta_{obs}\) \(SE_{obs}\) \(p_{obs}\) \(\beta_{true}\) \(SE_{true}\) \(p_{true}\)
(Intercept) 78.444 14.34 0.000 80.384 9.03 0.000
age -0.809 0.11 0.000 -0.883 0.07 0.000
bmi 1.681 0.55 0.003 1.776 0.35 0.000
sexfemale 32.756 20.78 0.117 43.460 14.16 0.002
smokeyes 1.615 2.91 0.580 3.516 1.99 0.078
bmi:sexfemale -1.131 0.88 0.199 -1.674 0.60 0.006

3.1 Answering the research question

When examining Table 3.1 Regression analysis of True and Observed data we observe several differences in the beta coefficients, standard error and p-values. The table contains variables with missing values, and an interaction effect. Although almost all beta coefficients are nearly equal, the beta coefficients of the observed data set are systematically underestimated. This is especially the case for sexfemale, as the difference between the beta coefficients is almost 9.0. When making inferences, the effect of sex on active heart rate would be underestimated.

Regarding the standard errors, missing data caused these parameters of the observed data set to be systemetically overestimated. Larger standard errors contribute to the possibility of making a type II error, as is the case in our data set. The larger standard errors in the observed data set might have played a role in the variables sexfemale and the interaction bmi:sexfemale turning non significant. When making inferences with the model based on the observed data, these variables would wrongly be neglected.

Concluding, the missing data causes the standard errors to be greater, resulting in less accurate beta coefficients. Moreover, some p-values turn out non significant, caused by underestimated beta coefficients. The model based on observed data leads thus to inaccurate inferences.


4 Missingness

There are 540 missing values. 0 for age, 0 for sex, 0 for intensity, 0 for rest, 58 for smoke, 92 for height, 93 for bmi, 123 for active, and 174 for weight. Moreover, there are 132 completely observed rows, 15 rows with one missing value, 37 rows with two missing values, 52 rows with three missing values, 55 rows with four missing values, 15 rows with five missing values.

The missingness in the data is non-monotone, because the variable with the least missing values (smoke) has observed values for other variables with more missingness (e.g., smoke and bmi). The missingness would be monotone if the variable with the least missing values (smoke), would have missing values on all other variables with more missingness (e.g., height). Interestingly, a monotone pattern is only the case for smoke and weight.

pattern of the missingnesspattern of the missingnesspattern of the missingnesspattern of the missingness

Figure 4.1: pattern of the missingness




4.1 Looking for the missingness

In this section we will investigate whether the mean of the missing values differ significantly from the mean of the observed values. This will be done do by using a paired sampled t-test for the numeric variables. In order to compare the mean of the missing values with the true values, we computed a logical vector for each vector that has missing observations. The missingness vectors have the value TRUE for all missing entries and FALSE for all observed entries. These missingness vectors will be used as grouping variable in the true data set to compare the missing values with the observed values. For smoke, which is a categorical variable, we will use a \(x^2\) test.

Table 4.1: Difference in means of observed data and missing data
variables \(M_{Obs}\) \(M_{True}\) \(t\) \(p\)
Weight 73.90 73.17out 0.381 0.704
Height 174.50 172.83 1.271 0.205
Bmi 24.11 23.95 0.336 0.737
Active 95.58 93.95 -0.606 0.545

smoke: \(x^2 =\) 1.154, \(p =\) 0.283

Comparing the distribution of the observed and true dataset

Figure 4.2: Comparing the distribution of the observed and true dataset

4.2 Missingness of weight

Table 4.2: Difference in means of missingness of Weight
variables \(M_{Obs}\) \(M_{True}\) \(t\) \(p\)
rest 69.56 70.17 -0.482 0.630
age 38.16 38.98 -0.590 0.556
height 174.28 175.39 -0.639 0.525
bmi 24.11 24.12 -0.012 0.990
active 91.54 96.81 -1.440 0.156

missing weight on sex: \(x^2 =\) 0, \(p =\) 1

missing weight on smoke: \(x^2 =\) 0.036, \(p =\) 0.848

missing weight on intensity: \(x^2 =\) 2.589, \(p =\) 0.274

Looking whether the missingness of weight is MARLooking whether the missingness of weight is MARLooking whether the missingness of weight is MARLooking whether the missingness of weight is MARLooking whether the missingness of weight is MARLooking whether the missingness of weight is MARLooking whether the missingness of weight is MARLooking whether the missingness of weight is MAR

Figure 4.3: Looking whether the missingness of weight is MAR

Looking at the p value of the missingness of weight on the other columns we can say that there is not a significant relation between any of the columns and the missing values of weight.

The three bar plots show a visualisation of how the missing data in the categorical columns is divided. The first plot shows us that there is almost no difference between missing values in weight for being a man or female in the sex column. The second plot also shows that there is almost no difference between missing values in the weight column for smokers and non smokers in the smoke column. The third column shows that how lower the intensity is the less missing values in weight you can expect.

The five scatterplots show a visualisation of how the missing data is divided in the rest of the columns. In the first two plots between weight and rest or age there is a clear trend where all the values with a low weight are missing and everything above that is not. The two plots after that between weight and height or bmi show the same thing, but also a cluster of missing values when both columns have low values. The last column between weight and active shows a clear trend where low values for either column result in missing values with a cluster where both columns have low values.

4.3 Missingness of height

Table 4.3: Difference in means of missingness of Height
variables \(M_{Obs}\) \(M_{True}\) \(t\) \(p\)
rest 69.93 69.60 0.242 0.809
age 38.66 38.18 0.320 0.749
bmi 24.11 24.11 -0.012 0.990
active 91.74 99.05 -1.535 0.137

missing height on sex: \(x^2 =\) 0, \(p =\) 1

missing height on smoke: \(x^2 =\) 0.111, \(p =\) 0.739

missing height on intensity: \(x^2 =\) 3.563, \(p =\) 0.168

Looking whether the missingness of height is MARLooking whether the missingness of height is MARLooking whether the missingness of height is MARLooking whether the missingness of height is MARLooking whether the missingness of height is MARLooking whether the missingness of height is MARLooking whether the missingness of height is MAR

Figure 4.4: Looking whether the missingness of height is MAR

Looking at the p value of the missingness of height on the other columns we can say that there is not a significant relation between any of the columns and the missing values of height.

The three bar plots show a visualisation of how the missing data in the categorical columns is divided. The first plot shows us that there is almost no difference between missing values in height for being a man or female in the sex column. The second plot also shows that there is almost no difference between missing values in the height column for smokers and non smokers in the smoke column. The third column shows that there is almost no difference between missing values in the height column for high and moderate intensity, but less missing values in the low intensity category.

The four scatterplots show a visualisation of how the missing data is divided in the rest of the columns. In the first two plots between height and rest or age there is a clear trend where all the values with a low height are missing and everything above that is not. The two plots after that between height and bmi or active show a clear trend where low values for either column result in missing values with a cluster where both columns have low values.

4.4 Missingness of Active

Table 4.4: Difference in means of missingness of active
variables \(M_{Obs}\) \(M_{True}\) \(t\) \(p\)
rest 69.02 71.03 -1.558 0.120
age 37.96 39.35 -0.963 0.337
height 174.41 174.77 -0.232 0.817
bmi 23.83 24.85 -1.883 0.062
weight 72.94 79.38 -1.948 0.059

missing active on sex: \(x^2 =\) 1.957, \(p =\) 0.162

missing active on smoke: \(x^2 =\) 0.293, \(p =\) 0.589

missing active on intensity: \(x^2 =\) 2.193, \(p =\) 0.334

Looking whether the missingness of active is MARLooking whether the missingness of active is MARLooking whether the missingness of active is MARLooking whether the missingness of active is MARLooking whether the missingness of active is MARLooking whether the missingness of active is MARLooking whether the missingness of active is MARLooking whether the missingness of active is MAR

Figure 4.5: Looking whether the missingness of active is MAR

Looking at the p value of the missingness of active on the other columns we can say that there is not a significant relation between any of the columns and the missing values of active.

The three bar plots show a visualisation of how the missing data in the categorical columns is divided. The first plot shows us that the female category in sex has less missing values in the active column than the male category. The second column shows that that smokers have a bit less missing values than non smokers in the active column. The third column shows that there is almost no difference between missing values in the active column for the moderate and low intensity category, but more missing values in the high intensity category.

The five scatterplots show a visualisation of how the missing data is divided in the rest of the columns. In the first two plots between active and rest or age there is a clear trend where all the values with a low active are missing and everything above that is not. The three plots after that between active and height, bmi or weight show a clear trend where low values for either column result in missing values with a cluster where both columns have low values.

4.5 Missingness of Bmi

Table 4.5: Difference in means of missingness of Height
variables \(M_{Obs}\) \(M_{True}\) \(t\) \(p\)
rest 69.84 69.81 0.021 0.983
age 38.35 38.90 -0.368 0.713
height 174.28 175.39 -0.639 0.525
active 92.12 95.11 -0.717 0.478

missing bmi on sex: \(x^2 =\) 0.019, \(p =\) 0.889

missing bmi on smoke: \(x^2 =\) 0, \(p =\) 1

missing bmi on intensity: \(x^2 =\) 1.476, \(p =\) 0.478

missing bmi on rest: \(t =\) 0.021, \(p =\) 0.983

missing bmi on age: \(t =\) -0.368, \(p =\) 0.713

missing bmi on height: \(t =\) -0.639, \(p =\) 0.525

missing bmi on active: \(t =\) -0.717, \(p =\) 0.478

Looking whether the missingness of bmi is MARLooking whether the missingness of bmi is MARLooking whether the missingness of bmi is MARLooking whether the missingness of bmi is MARLooking whether the missingness of bmi is MARLooking whether the missingness of bmi is MARLooking whether the missingness of bmi is MAR

Figure 4.6: Looking whether the missingness of bmi is MAR

Looking at the p value of the missingness of bmi on the other columns we can say that there is not a significant relation between any of the columns and the missing values of bmi.

The three bar plots show a visualisation of how the missing data in the categorical columns is divided. The first plot shows us that there is almost no difference between missing values in bmi for being a man or female in the sex column. The second plot also shows that there is almost no difference between missing values in the bmi column for smokers and non smokers in the smoke column. The third column shows that there is almost no difference between missing values in the bmi column for the moderate and low intensity category, but more missing values in the high intensity category.

The four/five scatterplots show a visualisation of how the missing data is divided in the rest of the columns. In the first two plots between bmi and rest or age there is a clear trend where all the values with a low bmi are missing and everything above that is not. The two plots after that between bmi and height or active show a clear trend where low values for either column result in missing values with a cluster where both columns have low values.

Missing a fifth scatterplot??? (weight)

4.6 Missingness of Smoke

missing smoke on sex: \(x^2 =\) 5.037, \(p =\) 0.025

missing smoke on intensity: \(x^2 =\) 1.722, \(p =\) 0.423

missing smoke on rest: \(t =\) 0.779, \(p =\) 0.438

missing smoke on age: \(t =\) -1.271, \(p =\) 0.208

missing smoke on height: \(t =\) -0.347, \(p =\) 0.731

missing smoke on bmi: \(t =\) -1.338, \(p =\) 0.188

missing smoke on weight: \(t =\) -0.785, \(p =\) 0.444

Looking whether the missingness of smoking is MARLooking whether the missingness of smoking is MARLooking whether the missingness of smoking is MARLooking whether the missingness of smoking is MARLooking whether the missingness of smoking is MARLooking whether the missingness of smoking is MARLooking whether the missingness of smoking is MAR

Figure 4.7: Looking whether the missingness of smoking is MAR

Looking at the p value of the missingness of smoke on the other columns we can say that there is only one significant relation between the missing values of smoke on sex.

The seven bar plots show a visualisation of how the missing data of smoke is divided in the other columns. The first plot shows us that the female category in sex has less missing values in the active column than the male category. The second plot shows that there is almost no difference between missing values in the intensity column for the high and low category, but less missing values in the moderate intensity category. The third plot shows that the missingness of smoke on rest is equally divided with two spikes where rest is lower than 55 and higher than 90. There is no more missing values after these spikes, except for one more spike where rest is 40. The forth plot shows that the missingness of smoke on age is equally divided with a spike where age is higher than 60. The fifth plot shows how smaller or higher (than a hight of 170-175) the hight gets te more missing values there are in the smoke column. Except for when the hight is around 205. Then there are close to no missing values. The sixth plot shows that the missingness of smoke on bmi is equally divided except for when bmi is on its lowest or highest. Then there are almost no missing values. The seventh plot shows that the missingness of smoke on weight is equally divided with one spike in the middle between a weight of 70 and 80. When weight is at its lowest or highest there are almost no missing values.